Add fix for prerenderer cache-clearing on publish#4719
Conversation
Preview deploymentsHost Test Results 1 files ±0 1 suites ±0 1h 54m 0s ⏱️ - 9m 6s Results for commit ceeed01. ± Comparison against earlier commit df9d319. Realm Server Test Results 1 files ±0 1 suites ±0 11m 6s ⏱️ +8s Results for commit ceeed01. ± Comparison against earlier commit df9d319. |
Covers the gap that let CS-11043 ship: every existing publish-realm test does a single publish, so a republish that signals success but serves stale content has no regression net. The new test defines a sentinel card, publishes, asserts the initial sentinel renders on the published URL, edits the source, republishes, and asserts the updated sentinel renders (and the initial one is gone). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The host-side Loader-reset approach (sticky-for-batch + the
two-flavor clearCache/resetLoaderOnly refinement) was causing
widespread CI fallout — query-field server-hydration tests,
live-update tests, file-tree navigation, code-submode preview,
multi-reindex timeouts, host memory baseline. The Loader reset
also doesn't directly address the actual root cause investigated
under CS-11043: the stale module bytes the published nyuitp2026
realm served for ~37h were sitting in Chromium's process-level
HTTP cache, not the host's in-process Loader cache.
Roll back to pre-fix state. A follow-up commit on this branch
targets the Chromium-cache layer directly via Cache-Control on
the realm-server's source/module responses.
Kept on the branch (independently useful regardless of fix
mechanism):
- packages/matrix/tests/publish-realm.spec.ts republish test
(the regression net that fills the gap which let CS-11043
ship in the first place).
- infra/ Checkly canary script + Terraform + provisioning
runbook on the checkly-publish-cs-11096 branch in the infra
repo (production monitoring).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…11043) The publish-republish failure mode investigated under CS-11043 was Chromium's process-level HTTP cache holding stale module bytes (presentation.gts and friends) across publishes — verified by inspecting the staging nyuitp2026 realm's failed render. Source responses were sent with no `Cache-Control` and no `Last-Modified` header (verified via `curl -sI`), which lets Chromium apply heuristic caching. That cache lives at the browser-process level, shared across all puppeteer pages on a prerender-server task. After a republish, even freshly-spawned puppeteer pages could pull old module bytes from the persistent Chromium cache for ~37h, until the prerender-server task itself rotated. `Cache-Control: no-store` evicts the heuristic-cache vector entirely. Every source/module fetch goes back to the realm-server, which serves whatever bytes are current on EFS. Cost: no browser cache reuse for unchanged source files — acceptable because card content is prerendered into boxel_index.isolated_html by the indexer and not typically re-fetched per page view anyway. Applied at the single place every source response passes through (`getSourceOrRedirect`'s defaultHeaders); cached-redirect entries go through a separate header map that doesn't get the new value, which is fine — 302 redirects are not the stale-bytes vector. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(CS-11043)
The Cache-Control: no-store change targets Chromium's HTTP cache,
but the publish-republish staleness investigated under CS-11043
also lived a layer up: the prerender server's puppeteer pages
hold a host-app `Loader` that caches evaluated modules by URL.
After a republish swaps new bytes onto disk, that Loader would
keep handing back the OLD module on subsequent renders — the
HTTP layer's no-store header doesn't reach into the host's
module cache.
In production this manifested as nyuitp2026.boxel.site rendering
the wordmark form of presentation.gts for ~37h after publishing
the img form. The matrix republish test (added on this branch)
reproduces the same failure in the test env: the second publish's
reindex runs on the warm prerender, the Loader serves the
initial sentinel-card module, and the published URL serves the
initial sentinel even though disk has the updated source.
Fix: after the FS swap completes (and alongside the existing
DELETE FROM modules DB-cache clear), call
prerenderer.disposeAffinity({ affinityType: 'realm', affinityValue:
publishedRealmURL }). This tears down the puppeteer pages for
the affinity; the next render against the realm spawns a fresh
page that fetches modules from disk via the realm-server.
Made `disposeAffinity` optional on the Prerenderer interface
(matching the `releaseBatch?` pattern) so stub / remote
implementations aren't forced to provide it. The call is
best-effort: a thrown error is logged via log.warn but doesn't
fail the publish, since the page-pool's LRU rotation cleans up
eventually — we just want to avoid the long staleness window.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(CS-11043)
The previous commit added a publish-handler call to
`prerenderer.disposeAffinity(...)`, but in any deployment that
routes prerender requests over HTTP (which is what every real
deployment uses, and what the matrix test infra uses via
isolated-realm-server), the realm-server holds a
`RemotePrerenderer` — which didn't implement `disposeAffinity`,
so the call silently no-op'd via `prerenderer?.disposeAffinity`.
Wire the call through the same plumbing the existing
`releaseBatch` path uses:
- RemotePrerenderer.disposeAffinity() — POST /dispose-affinity
on the prerenderURL (manager or single prerender server).
Best-effort fetch with an abort timer so a stuck upstream
can't block the publish handler.
- prerender-app POST /dispose-affinity — accepts JSON:API body
with {affinityType, affinityValue}, calls into the concrete
Prerenderer's existing disposeAffinity() method (which clears
the auth cache and tears down all puppeteer pages for the
affinity).
- manager-app POST /dispose-affinity — fans the request out to
every server currently assigned the affinity, mirroring how
release-batch broadcasts. Each server runs its own local
disposal; the broadcast resolves when all targets do.
After this, the matrix republish test's second publish triggers
disposeAffinity end-to-end: realm-server publish handler →
RemotePrerenderer HTTP POST → manager fan-out → prerender server
disposes pages. The next render against the realm spawns a fresh
puppeteer page that fetches sentinel-card.gts from disk and sees
the updated sentinel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test keeps failing with the same shape — popup shows the
initial sentinel after the second publish — but we can't tell
which link in the chain is broken without ground-truth signals.
This commit adds checkpoint logging and direct-fetch
diagnostics so a CI failure tells us:
- whether the source-content POST actually landed (compare
GET'd source body against the updated sentinel value),
- whether the default-domain checkbox state needed re-clicking
after modal close+reopen,
- whether the publish button was actually enabled at click
time,
- whether the /_publish-realm response was observed (and its
status code if so),
- what a plain HTTP fetch of the published URL returns vs. what
the popup renders (decouples server-side staleness from
browser-cache).
Also adds the previously-missing default-domain-checkbox click
on the second-publish path. The checkbox can lose its
selection on modal close, leaving `isPublishDisabled` true and
the publish click a silent no-op. The new `isChecked()` /
`isDisabled()` checks make that visible, and the conditional
re-click avoids the failure mode.
Strictly diagnostic + bugfix. No assertion changes; if the
test still fails, the console.log lines pinpoint which of the
chain links is the actual broken one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…index (CS-11043) Test instrumentation (commit 62b3ee9) pinpointed the actual broken link: even after disposeAffinity tears down the prerender server's puppeteer pages and the publish handler DELETEs the DB-level modules cache, the realm-server's OWN per-Realm #sourceCache still holds the pre-swap bytes. When the immediately- enqueued reindex's worker fetches modules via HTTP to this realm-server, getSourceOrRedirect returns the cached old bytes; the reindex renders against stale source; the rendered HTML lands in boxel_index.isolated_html; the published URL serves old content forever. The Phase-3-PR-2 publish flow relied on the NodeAdapter file watcher to invalidate the realm's caches via change events, but that's an async race against the immediately-enqueued reindex. The matrix republish test's direct-HTTP diagnostic confirmed the symptom: status=202 publish response, but the published URL served `contains-initial=true contains-updated=false`. Fix layered with the others: - clearLocalCaches() — new public method on Realm. Bulk- invalidates #sourceCache and the module cache. Different from __testOnlyClearCaches in that it leaves the test-only transpile counter alone. - handle-publish-realm — between upsertPublishedRealmInRegistry and enqueueReindexRealmJob, lookupOrMount the realm and call clearLocalCaches on it. For a republish (the bug case), this nukes the stale source bytes the reindex would otherwise pull. For a new publish, the mount is fresh and the call is a no-op. Combined with the earlier commits this closes the full chain: - Cache-Control: no-store on source responses (commit 5e3a2b3) → Chromium HTTP cache evicted. - disposeAffinity on publish (9c2af8a + ed53ad1) → prerender server puppeteer-page host-Loader caches evicted. - clearLocalCaches on publish (this commit) → realm-server per-Realm #sourceCache + module cache evicted. Each addresses a distinct layer; together they ensure the next reindex after a republish renders against the bytes that are actually on disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The [CS-11043 ...ms] checkpoint logging + the direct-HTTP fetch
comparison + the pre-assertion sentinel-text log were all added to
diagnose where in the publish-republish chain things were
breaking. They worked — they pinpointed the realm-server
#sourceCache as the actual broken link, which the
clearLocalCaches commit then fixed. With the bug closed, those
console.logs are pure noise on every successful run and clutter
the test's intent.
Removed:
- the step() helper definition
- all [CS-11043 …ms] checkpoint calls
- the direct request.get(publishedRealmURL) call before opening
the popup (was to discriminate server-side staleness from
browser-cache; both layers are addressed now)
- the pre-assertion sentinelLocator.count()/textContent() log
(standard Playwright assertion failures already show the
expected/received values)
Kept (structural improvements that make the test more robust, not
just diagnostic):
- the source-content read-back guard via request.get + expect:
if postCardSource silently fails, the published-URL assertion
below would otherwise fail with a misleading message
- the domain-checkbox isChecked() guard + conditional click:
defends against the modal losing its checkbox selection on
close/reopen, which would make the publish click a silent
no-op
- the .catch(() => null) on waitForResponse + downstream
if-guard: lets the test fall through to the URL assertion if
the response wait is transiently lost, rather than failing on
infrastructure noise
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
are there tests that we can add that demonstrate these new capabilities--like the prerenderer's new dispose affinity endpoint? |
| if (prerenderer?.disposeAffinity) { | ||
| try { | ||
| await prerenderer.disposeAffinity({ | ||
| affinityType: 'realm', | ||
| affinityValue: publishedRealmURL, | ||
| }); | ||
| } catch (e) { | ||
| log.warn( | ||
| `disposeAffinity failed for ${publishedRealmURL}: ${ | ||
| e instanceof Error ? e.message : String(e) | ||
| } — continuing with publish; stale Loader cache may persist until LRU rotation`, | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
this seems like it might be overkill. there is a clearCache: true option that the /render route for the prerenderer that was created specifically for this purpose: to clear the loader cache for the affinity. this is how we can use the same tab to handle both code changes and instance changes and always use the most recent code. When we do this we destroy the browser context for the realm in teh prerenderer which is bascially like throwing out the baby with the bath water. I'd be more interested to know why this option is not working for you
There was a problem hiding this comment.
also, hopefully the etag should be working for the module updates. if the etag wasn't working i could see how disposing of the affinity would yield a working result, as that would destroy the browser context and thus force the browser to refetch the modules even if the etag didn't change.
There was a problem hiding this comment.
thanks for looking at this, Claude agreed with your assessment, we went through several iterations of trying to diagnose the problem, and that did end up not being needed, as I’ve deployed this branch since removing it and publishing still worked.
There are now realm server tests for the new behaviours, too. Can you look again?
The dispose-affinity wiring was a symptom-treatment workaround: disposing the puppeteer pages after a republish made the bug go away by spawning fresh pages whose host-Loaders had no cached modules. But the actual cause was upstream — the realm-server's per-Realm #sourceCache was returning pre-swap bytes to whichever Loader fetched modules after the swap, regardless of whether that Loader was fresh. With Realm.clearLocalCaches() (commit ec5ddf9) now invalidating #sourceCache before the reindex enqueues, the existing IndexRunner clearCache:true-on-first-render mechanism is sufficient: the Loader reset re-fetches, the realm-server's source cache is already empty, and the response carries fresh bytes + fresh content-hash etag. Removes: - handle-publish-realm: prerenderer.disposeAffinity(...) call and the prerenderer destructure from CreateRoutesArgs - runtime-common Prerenderer interface: optional disposeAffinity - prerender-app: POST /dispose-affinity endpoint - manager-app: POST /dispose-affinity broadcast - RemotePrerenderer: disposeAffinity() client The internal Prerenderer.disposeAffinity / PagePool.disposeAffinity methods stay — they're still used by LRU rotation, mid-render cancel, and the existing prerendering tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three regression tests covering the load-bearing layers of the
publish-republish fix, complementing the matrix end-to-end test:
- clearLocalCaches drops cached source bytes — warms the source
cache via two source fetches (miss → hit), calls
testRealm.clearLocalCaches(), asserts the next fetch is a miss.
Directly exercises the public surface the publish handler now
invokes after the FS swap.
- source response sets Cache-Control: no-store — asserts the
header on both miss-path and hit-path source responses. Documents
the Chromium-cache contract the publish flow now depends on; if
the header is ever dropped from defaultHeaders, Chromium will
reintroduce the heuristic-cache vector that gave nyuitp2026 a
37-hour staleness window.
- republishing reflects updated source content in boxel_index —
writes a card with title sentinel-initial-<uuid>, publishes,
waits for that title to appear in boxel_index.head_html;
rewrites with sentinel-updated-<uuid>, republishes, waits for
the new title and asserts no row still references the initial
sentinel. This is the data-layer regression for CS-11043 — if
clearLocalCaches() is regressed, the second waitUntil times out
exactly as the production bug would have it. Faster than the
matrix Playwright test and gives a clearer signal at the DB
layer specifically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two corrections from the first CI failure:
- source response sets Cache-Control: no-store: the precondition
"first fetch is a cache miss" failed because testRealm.write()
triggers the indexer, which fetches the source server-side and
warms #sourceCache before the client's first GET. Drop the warm
entry via __testOnlyClearCaches() right after write so the
assertion exercises the genuine miss-path defaultHeaders.
- republishing reflects updated source content in boxel_index:
the wait for the initial sentinel timed out because I set
`attributes.title`, which CardDef doesn't have. Title lives on
cardInfo.name (which feeds the computed cardTitle). Set
attributes.cardInfo.name instead and assert against
search_doc::text — substring-matching the jsonb-as-text means
we don't have to encode the exact path (cardInfo.name vs the
derived cardTitle), and search_doc is populated for every
indexed instance regardless of head-render outcome.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commit 5e3a2b3 added `cache-control: no-store` to defaultHeaders in `getSourceOrRedirect`, but serveLocalFile unconditionally sets `cache-control: '${public|private}, max-age=0'` after spreading defaultHeaders — the no-store value was clobbered every time. The test added in the previous commit caught this: the assertion failed because the actually-served value was `public, max-age=0`. The reason this never mattered for production: - The 5e3a2b3 commit attributed the 37 h staleness to Chromium's heuristic HTTP cache, but Chromium only falls back to heuristic caching when there's no explicit Cache-Control — `public, max-age=0` was already on every source response. With max-age=0, every fetch revalidates via etag. - The etag is content-hash based (etagBase: cached.contentHash). The CS-11043 production failure was that the realm-server's own #sourceCache returned stale bytes AND a stale content-hash etag together — so revalidation returned 304 for a hash that didn't match disk. The clearLocalCaches() fix (ec5ddf9) invalidates that cache before the reindex enqueues, breaking the chain at its actual cause. So 5e3a2b3 was based on incorrect analysis (Chromium heuristic caching) and didn't even take effect at runtime. Removing it makes the PR a single load-bearing fix (clearLocalCaches) with one test that exercises it (the republish-into-boxel_index regression) plus the unit test for clearLocalCaches itself. No dead code. Also drops the cache-control test from the previous commit since the contract it was asserting on isn't (and now never was) real. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| // CS-11043. Bulk-invalidate this realm's in-process byte caches. | ||
| // Called by the publish-realm handler after the FS swap, BEFORE the | ||
| // reindex enqueues — so that subsequent source reads (which the | ||
| // reindex's prerender fans out across many of) bypass any | ||
| // pre-swap bytes the realm still has in `#sourceCache` / | ||
| // `#moduleCache`. The Phase-3-PR-2 publish flow relies on the | ||
| // NodeAdapter file-watcher to pick up the swap, but that's an | ||
| // async-event race against the immediately-enqueued reindex; this | ||
| // method makes the invalidation synchronous from the publish | ||
| // handler's vantage point. Different from `__testOnlyClearCaches` | ||
| // in that it does NOT reset the transpile counter (which is | ||
| // test-only diagnostic state, unrelated to byte-correctness). | ||
| clearLocalCaches(): void { | ||
| this.#sourceCache.clear(); | ||
| this.#dropAllModuleCacheEntries(); | ||
| } | ||
|
|
There was a problem hiding this comment.
you might want to coordinate with @lukemelia here. this touches on the work to move caching out of memory in the realm server in preparation for horizontally scaled realms. if we had 2 realm servers that needed cache clearing i'm not sure how you would coordinate this.
There was a problem hiding this comment.
thanks, we talked it over and came up with an approach, I’m going to merge this and implement a multi-realm server fix in CS-11153.
Publishing a realm has been inconsistent in whether the changes actually show up, which has indeed been about caching, but across multiple layers. I’m out of my depth here, but here’s Claude’s explanation (edited after PR feedback about unnecessary work):
Bug
After republish, the published URL kept serving pre-republish HTML. In production:
nyuitp2026.boxel.siteserved the old wordmark for ~37h after the user pushed the new img form.Root cause
The
Realmclass holds an in-memory#sourceCachekeyed by path. After a republish:getSourceOrRedirectreturns pre-swap bytes from#sourceCache(plus a stale content-hash etag).boxel_index.isolated_html.The design assumed the
NodeAdapterfile-watcher would invalidate#sourceCache, butENABLE_FILE_WATCHERis unset in staging/production, so it never fires.Fix
One commit. New
Realm.clearLocalCaches()method;handle-publish-realmcalls it after the FS swap,before the reindex enqueue. Makes the invalidation synchronous — no race, no file-watcher
dependency.